black individual
Appendix A Additional results This appendix section shows additional results and corresponding plots to support the insights
Section A.2 shows results using a chat-style verbalized numeric Section A.3 shows results on four extra benchmark tasks made available with Finally, Section A.5 presents and discusses results on feature In this section, we evaluate risk score calibration on the income prediction task across different subpopulations, such as typically done as part of a fairness audit. Figures A1-A2 show group-conditional calibration curves for all models on the ACSIncome task, evaluated on three subgroups specified by the race attribute in the ACS data. We show the three race categories with largest representation. The'Mixtral 8x22B' and'Yi 34B' models shown are the worst offenders, where samples belonging to the'Black' population see consistently lower scores for the same positive label probability when compared to the'Asian' or'White' populations. On average, the'Mixtral 8x22B (it)' model classifies a Black individual with a In fact, this score bias can be reversed for some base models, overestimating scores from Black individuals compared with other subgroups.
- Oceania > New Zealand (0.04)
- North America > United States > California (0.04)
Detecting Statistically Significant Fairness Violations in Recidivism Forecasting Algorithms
Machine learning algorithms are increasingly deployed in critical domains such as finance, healthcare, and criminal justice [1]. The increasing popularity of algorithmic decision-making has stimulated interest in algorithmic fairness within the academic community. Researchers have introduced various fairness definitions that quantify disparities between privileged and protected groups, use causal inference to determine the impact of race on model predictions, and that test calibration of probability predictions from the model. Existing literature does not provide a way in which to assess whether observed disparities between groups are statistically significant or merely due to chance. This paper introduces a rigorous framework for testing the statistical significance of fairness violations by leveraging k-fold cross-validation [2] to generate sampling distributions of fairness metrics. This paper introduces statistical tests that can be used to identify statistically significant violations of fairness metrics based on disparities between predicted and actual outcomes, model calibration, and causal inference techniques [1]. We demonstrate this approach by testing recidivism forecasting algorithms trained on data from the National Institute of Justice. Our findings reveal that machine learning algorithms used for recidivism forecasting exhibit statistically significant bias against Black individuals under several fairness definitions, while also exhibiting no bias or bias against White individuals under other definitions. The results from this paper underscore the importance of rigorous and robust statistical testing while evaluating algorithmic decision-making systems.
- North America > United States > California > San Diego County > San Diego (0.04)
- Asia > India (0.04)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.93)
- Law (1.00)
- Health & Medicine (1.00)
- Government > Regional Government > North America Government > United States Government (0.88)
Appendix A Additional results This appendix section shows additional results and corresponding plots to support the insights
Section A.2 shows results using a chat-style verbalized numeric Section A.3 shows results on four extra benchmark tasks made available with Finally, Section A.5 presents and discusses results on feature In this section, we evaluate risk score calibration on the income prediction task across different subpopulations, such as typically done as part of a fairness audit. Figures A1-A2 show group-conditional calibration curves for all models on the ACSIncome task, evaluated on three subgroups specified by the race attribute in the ACS data. We show the three race categories with largest representation. The'Mixtral 8x22B' and'Yi 34B' models shown are the worst offenders, where samples belonging to the'Black' population see consistently lower scores for the same positive label probability when compared to the'Asian' or'White' populations. On average, the'Mixtral 8x22B (it)' model classifies a Black individual with a In fact, this score bias can be reversed for some base models, overestimating scores from Black individuals compared with other subgroups.
- Oceania > New Zealand (0.04)
- North America > United States > California (0.04)
Popular LLMs Amplify Race and Gender Disparities in Human Mobility
As large language models (LLMs) are increasingly applied in areas influencing societal outcomes, it is critical to understand their tendency to perpetuate and amplify biases. This study investigates whether LLMs exhibit biases in predicting human mobility -- a fundamental human behavior -- based on race and gender. Using three prominent LLMs -- GPT-4, Gemini, and Claude -- we analyzed their predictions of visitations to points of interest (POIs) for individuals, relying on prompts that included names with and without explicit demographic details. We find that LLMs frequently reflect and amplify existing societal biases. Specifically, predictions for minority groups were disproportionately skewed, with these individuals being significantly less likely to be associated with wealth-related points of interest (POIs). Gender biases were also evident, as female individuals were consistently linked to fewer career-related POIs compared to their male counterparts. These biased associations suggest that LLMs not only mirror but also exacerbate societal stereotypes, particularly in contexts involving race and gender.
- North America > United States > Massachusetts (0.04)
- North America > United States > District of Columbia > Washington (0.04)
- Transportation (0.46)
- Government > Regional Government (0.46)
Evaluating Gender, Racial, and Age Biases in Large Language Models: A Comparative Analysis of Occupational and Crime Scenarios
Mirza, Vishal, Kulkarni, Rahul, Jadhav, Aakanksha
Recent advancements in Large Language Models(LLMs) have been notable, yet widespread enterprise adoption remains limited due to various constraints. This paper examines bias in LLMs-a crucial issue affecting their usability, reliability, and fairness. Researchers are developing strategies to mitigate bias, including debiasing layers, specialized reference datasets like Winogender and Winobias, and reinforcement learning with human feedback (RLHF). These techniques have been integrated into the latest LLMs. Our study evaluates gender bias in occupational scenarios and gender, age, and racial bias in crime scenarios across four leading LLMs released in 2024: Gemini 1.5 Pro, Llama 3 70B, Claude 3 Opus, and GPT-4o. Findings reveal that LLMs often depict female characters more frequently than male ones in various occupations, showing a 37% deviation from US BLS data. In crime scenarios, deviations from US FBI data are 54% for gender, 28% for race, and 17% for age. We observe that efforts to reduce gender and racial bias often lead to outcomes that may over-index one sub-class, potentially exacerbating the issue. These results highlight the limitations of current bias mitigation techniques and underscore the need for more effective approaches.
- North America > United States > New York (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
To which reference class do you belong? Measuring racial fairness of reference classes with normative modeling
Rutherford, Saige, Wolfers, Thomas, Fraza, Charlotte, Harrnet, Nathaniel G., Beckmann, Christian F., Ruhe, Henricus G., Marquand, Andre F.
Reference classes in healthcare establish healthy norms, such as pediatric growth charts of height and weight, and are used to chart deviations from these norms which represent potential clinical risk. How the demographics of the reference class influence clinical interpretation of deviations is unknown. Using normative modeling, a method for building reference classes, we evaluate the fairness (racial bias) in reference models of structural brain images that are widely used in psychiatry and neurology. We test whether including race in the model creates fairer models. We predict self-reported race using the deviation scores from three different reference class normative models, to better understand bias in an integrated, multivariate sense. Across all of these tasks, we uncover racial disparities that are not easily addressed with existing data or commonly used modeling techniques. Our work suggests that deviations from the norm could be due to demographic mismatch with the reference class, and assigning clinical meaning to these deviations should be done with caution. Our approach also suggests that acquiring more representative samples is an urgent research priority.
- Europe > Netherlands > Gelderland > Nijmegen (0.04)
- Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (11 more...)
- Health & Medicine > Therapeutic Area > Neurology (1.00)
- Health & Medicine > Diagnostic Medicine > Imaging (1.00)
- Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.93)
- (2 more...)
Bias in Generative AI
Zhou, Mi, Abhishek, Vibhanshu, Derdenger, Timothy, Kim, Jaymo, Srinivasan, Kannan
This study analyzed images generated by three popular generative artificial intelligence (AI) tools - Midjourney, Stable Diffusion, and DALLE 2 - representing various occupations to investigate potential bias in AI generators. Our analysis revealed two overarching areas of concern in these AI generators, including (1) systematic gender and racial biases, and (2) subtle biases in facial expressions and appearances. Firstly, we found that all three AI generators exhibited bias against women and African Americans. Moreover, we found that the evident gender and racial biases uncovered in our analysis were even more pronounced than the status quo when compared to labor force statistics or Google images, intensifying the harmful biases we are actively striving to rectify in our society. Secondly, our study uncovered more nuanced prejudices in the portrayal of emotions and appearances. For example, women were depicted as younger with more smiles and happiness, while men were depicted as older with more neutral expressions and anger, posing a risk that generative AI models may unintentionally depict women as more submissive and less competent than men. Such nuanced biases, by their less overt nature, might be more problematic as they can permeate perceptions unconsciously and may be more difficult to rectify. Although the extent of bias varied depending on the model, the direction of bias remained consistent in both commercial and open-source AI generators. As these tools become commonplace, our study highlights the urgency to identify and mitigate various biases in generative AI, reinforcing the commitment to ensuring that AI technologies benefit all of humanity for a more inclusive future.
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.05)
- North America > United States > New York (0.04)
- Education > Educational Setting (0.93)
- Banking & Finance > Economy (0.67)
Facial recognition surveillance in São Paulo could worsen racism
São Paulo, Brazil – As the city of São Paulo prepares to roll out thousands of surveillance cameras with facial recognition, experts are raising concerns on the indiscriminate use of this technology in the Brazilian megalopolis could exacerbate problems such as structural racism and inequality, while also posing risks to data privacy and cybersecurity. The Smart Sampa project is the latest among a series of initiatives involving modern surveillance techniques in various Brazilian states. It is significant due to the sheer size of the population it will impact: São Paulo, the most populous city in the Southern Hemisphere, is home to 12 million people. The project aims to roll out a single video surveillance platform that integrates and supports the operations of emergency and traffic services, the city's public transport network, and police forces. By 2024, up to 20,000 cameras will be installed, and an equal number of third-party and private cameras will be integrated into the network.
- South America > Brazil > São Paulo > São Paulo (0.35)
- South America > Brazil > Rio de Janeiro > Rio de Janeiro (0.06)
- South America > Argentina > Pampas > Buenos Aires F.D. > Buenos Aires (0.05)
- (3 more...)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Government (1.00)
Evaluate underdiagnosis and overdiagnosis bias of deep learning model on primary open-angle glaucoma diagnosis in under-served patient populations
Lin, Mingquan, Xiao, Yuyun, Hou, Bojian, Wanyan, Tingyi, Sharma, Mohit Manoj, Wang, Zhangyang, Wang, Fei, Van Tassel, Sarah, Peng, Yifan
In the United States, primary open-angle glaucoma (POAG) is the leading cause of blindness, especially among African American and Hispanic individuals. Deep learning has been widely used to detect POAG using fundus images as its performance is comparable to or even surpasses diagnosis by clinicians. However, human bias in clinical diagnosis may be reflected and amplified in the widely-used deep learning models, thus impacting their performance. Biases may cause (1) underdiagnosis, increasing the risks of delayed or inadequate treatment, and (2) overdiagnosis, which may increase individuals' stress, fear, well-being, and unnecessary/costly treatment. In this study, we examined the underdiagnosis and overdiagnosis when applying deep learning in POAG detection based on the Ocular Hypertension Treatment Study (OHTS) from 22 centers across 16 states in the United States. Our results show that the widely-used deep learning model can underdiagnose or overdiagnose underserved populations. The most underdiagnosed group is female younger (< 60 yrs) group, and the most overdiagnosed group is Black older (>=60 yrs) group. Biased diagnosis through traditional deep learning methods may delay disease detection, treatment and create burdens among under-served populations, thereby, raising ethical concerns about using deep learning models in ophthalmology clinics.
- North America > United States > California > Los Angeles County > Los Angeles (0.05)
- North America > United States > Texas > Travis County > Austin (0.04)
- North America > United States > Pennsylvania (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
Rite Aid used facial recognition in secret across hundreds of its stores
Drugstore chain Rite Aid secretly deployed facial recognition software across a network of security cameras in hundreds of locations in the US, according to a new investigation from Reuters published on Tuesday. The company had been doing so for more than eight years, and it only recently stopped using the technology, it told Reuters, following a "larger industry conversation" around facial recognition and the grave concern over privacy risks and racial discrimination it presents. Yet, Reuters says Rite Aid initially defended its use of facial recognition as a deterrent against theft and violent crime, having nothing to do with race. The investigation found that not to be entirely true. "In areas where people of color, including Black or Latino residents, made up the largest racial or ethnic group, Reuters found that stores were more than three times as likely to have the technology," the report reads.
- North America > United States > New York (0.06)
- North America > United States > California > Los Angeles County > Los Angeles (0.06)
- North America > United States > California > Alameda County > Oakland (0.06)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
- Retail (0.96)
- Health & Medicine > Consumer Health (0.96)